A Subcategorisation Lexicon for German Verbs induced from a Lexicalised PCFG
نویسنده
چکیده
The paper presents a large-scale computational subcategorisation lexicon for several thousand German verbs. The lexical entries were obtained by unsupervised learning in a statistical grammar framework: a German context-free grammar containing frame-predicting grammar rules and information about lexical heads was trained on 18.7 million words of a large German newspaper corpus. We developed a simple methodology to utilise frequency distributions in the lexicalised version of the probabilistic grammar for inducing syntactic verb frame descriptions. The frame definition is variable with respect to the inclusion of prepositional phrase refinement. An evaluation against a manual dictionary justifies the utilisation of the machine-readable lexicon as a valuable component for supporting NLP-tasks. As to our knowledge, no former computational approach has obtained a subcategorisation lexicon for German comparable in size (the number of verbs in the lexicon), restriction (no limit concerning the frequencies of the verbs), or verified reliability (successful extensive evaluation against dictionary).
منابع مشابه
Subcategorisation Acquisition from Raw Text for a Free Word-Order Language
We describe a state-of-the-art automatic system that can acquire subcategorisation frames from raw text for a free word-order language. We use it to construct a subcategorisation lexicon of German verbs from a large Web page corpus. With an automatic verb classification paradigm we evaluate our subcategorisation lexicon against a previous classification of German verbs; the lexicon produced by ...
متن کاملInducing German Semantic Verb Classes from Purely Syntactic Subcategorisation Information
The paper describes the application of kMeans, a standard clustering technique, to the task of inducing semantic classes for German verbs. Using probability distributions over verb subcategorisation frames, we obtained an intuitively plausible clustering of 57 verbs into 14 classes. The automatic clustering was evaluated against independently motivated, handconstructed semantic verb classes. A ...
متن کاملThe Lexicon-Grammar Balance in Robust Parsing of Italian
What is the role of lexical information in robust parsing of unrestricted texts? In this paper we provide experimental evidence showing that, in order to strike the balance between robustness and coverage needed for practical NLP applications, judicious use of positive lexical evidence given a text should be complemented with a battery of dynamic parsing strategies aimed at solving local constr...
متن کاملSmoothing fine-grained PCFG lexicons
We present an approach for smoothing treebank-PCFG lexicons by interpolating treebank lexical parameter estimates with estimates obtained from unannotated data via the Inside-outside algorithm. The PCFG has complex lexical categories, making relative-frequency estimates from a treebank very sparse. This kind of smoothing for complex lexical categories results in improved parsing performance, wi...
متن کاملSubcategorization Acquisition and Evaluation for Chinese Verbs
This paper describes the technology and an experiment of subcategorization acquisition for Chinese verbs. The SCF hypotheses are generated by means of linguistic heuristic information and filtered via statistical methods. Evaluation on the acquisition of 20 multi-pattern verbs shows that our experiment achieved the similar precision and recall with former researches. Besides, simple application...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002